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Abstract 

The statistical comparison of multiple algorithms over multiple data sets is fundamental in 
machine learning. This is typically carried out by the Friedman test. When the Friedman 
test rejects the null hypothesis, multiple comparisons are carried out to establish which 
are the significant differences among algorithms. The multiple comparisons are usually 
performed using the mean-ranks test. The aim of this technical note is to discuss the 
inconsistencies of the mean-ranks post-hoc test with the goal of discouraging its use in 
machine learning as well as in medicine, psychology, etc.. We show that the outcome of the 
mean-ranks test depends on the pool of algorithms originally included in the experiment. 
In other words, the outcome of the comparison between algorithms A and B depends also 
on the performance of the other algorithms included in the original experiment. This can 
lead to paradoxical situations. For instance the difference between A and B could be 
declared significant if the pool comprises algorithms C, D, E and not significant if the pool 
comprises algorithms F, G, H. To overcome these issues, we suggest instead to perform the 
multiple comparison using a test whose outcome only depends on the two algorithms being 
compared, such as the sign-test or the Wilcoxon signed-rank test. 

Keywords: statistical comparison, Friedman test, post-hoc test 


1. Introduction 


The statistical comparison of multiple algorithms over multiple data sets is fundamental in 
machine learning; it is typically carried o ut by means of a statistical test. The recommended 
approach is the Friedman test ( Demsar . 2009 ). Being non-parametric, it does not require 
commensurability of the measures across different data sets, it does not assume normality 
of the sample means and it is robust to outliers. 

When the Friedman test rejects the null hypothesis of no difference among the algo¬ 
rithms, post-hoc analysis is carried out to assess which differences are significant. A series 
of pairwise comparison is performed a djusting the significance level via Bonferr oni correc¬ 


tion or other more powerful approaches (Demsa r. 200 
the family-wise Type I error. 


Garcia and Herrera , 200 8) to control 
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The mean-ranks post-hoc test ( McDonald and Thompson . 1967 : Nemenvi . 1963h . is rec¬ 
ommended as pairwis e test for multiple comparisons in most books o f nonparametric statis¬ 
tics: see for instance (Gibbo ns and Chakrabortil . l201ll . Sec. 12.2.1), (jKvam and Vidakovid . 
2007, Sec. 8.2) and (iSheskinl. 120031 . S ec. 25 .2). It is also commonly used in machine learning 


( Demsarl . [20(9 : iGarcia and Herreral . 120081 ). The mean-ranks test is based on the statistic: 


2 = | Ra - Rb\/ 


m(m + 1) 
6 n 


where Ra , Rb are the mean ranks (as computed by the Friedman test) of algorithms A and 
B, m is the number of algorithms to be compared and n the number of datasets. The mean- 
ranks Ra, Rb are computed considering the performance of all the m algorithms. Thus the 
outcome of the comparison between A and B depends also on the performance of the 
other (m-2) algorithms included in the original experiment. This can lead to paradoxical 
situations. For instance the difference between A and B could be declared significant if 
the pool comprises algorithms C,D,E and not significant if the pool comprises algorithms 
F, G, H. The performance of the remaining algorithms should instead be irrelevant when 
compar i ng algorithms A and B. This proble m has been p ointed out several times in the past 
(Miller, 1966 : Gabriel . 19691 : Fligner, 19841 ) and also in ( Hollander et ah . 20131 . Sec. 7.3). 
Yet it is ignored by most literature on nonparametric statistics. However this issue should 
not be ignored, as it can increase the type I error when comparing two equivalent algorithms 
and conversely decrease the power when comparing algorithms whose performance is truly 
different. In this technical note, all these inconsistencies of the mean-ranks test will be 
discussed in details and illustrated by means of highlighting examples with the goal of 
discouraging its use in machine learning as well as in medicine, psychology, etc.. 


To avoid theses issues, we instead recommend to perform the pairwise comparisons of 
the post-hoc analysis using the Wilcoxon signed-rank test or the sign test. The decisions of 
such tests do not depend on the pool of algorithms included in the initial experiment. It 
is understood that, regardless the specific test adopted for the pairwise comparisons, it is 
necessary to control the family-wise type I error. This can be obtained through Bonferron i 
correction or through more powerful approaches ( Demsar . 20061 : Garcia and Herrera . 20081 ) . 


Even better would be the ad option o f the Bayesian methods for hypothesis te sting. They 
overcome the many drawbacks ( Demsar . 2008 : Goodman . 1999i : Kruschke . 2010 1 of the null- 
hypothesis significance tests. For i nstance. Bayesian cou n terparts of the Wilcox on and of the 


sign test have been presented in dBenavoli et all 120141 : iBenavoli et al.l . 120141 1: a Bayesian 


approach f or co mparin g cro ss-validated algorithms on multiple data sets is discussed by 
( Corani and Benavoli . 20151 ). 
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2. Friedman test 

tested on multiple datasets can be organized in a 

Datasets 

*12 ••• X ln 

*22 • • • *2 n (1) 

*m2 - - - *mn 

where X tJ denotes the performance of the i-th algorithm on the j-th dataset (for i = 1,..., m 
and j = 1 ,,n). The observations (performances) in different columns are assumed to be 
independent. The algorithms are ranked column-by-column and each entry is replaced 
by its rank relative to the other observations in the j-th column: 


The performance of multiple algorithms 
matrix: 



*11 

so 

*21 






*ml 


R\\ 

R\2 ■ • 

• Rln 

R21 

R 22 • • 

• ^2 n 

R-rn 1 

Rm2 • • 

• Ftmn 


( 2 ) 


where Rij is the rank of the algorithm i in the j-th dataset. The sum of the i-th row 
Ri = X^=i Rij- V i = 1 ,...,7n, depends on how the i-th algorithm performs w.r.t. the 
other (m — 1) algorithms. Under the null hypothesis of the Friedman test (no difference 
between the algorithms) the average value of Ri is n(m +1)/2. The statistic of the Friedman 
test is 


S 


12 

nm(m + 1) 


n 


E 



n(m + 1) 


(3) 


which under the null hypothesis has a chi-squared distribution with m —1 degrees of freedom. 
For m = 2, the Friedman test corresponds to the sign test. 


3. Mean ranks post-hoc test 


If the Friedman test rejects the null hypothesis one has to establish which are the significant 
differences among the algorithms. If all classifiers are compared to each other, one has to 
perform m(m — l)/2 pairwise comparisons. 

When performing multiple comparisons, one has to control the family-wise error rate, 
namely the probability of at least one erroneous rejection of the null hypothesis among the 
m(m— l)/2 pairwise comparisons. In the following example we control the family-wise error 
(FWER) rate t hrough the Bo n ferroni correction, even tho ugh more powerful techniques are 
also available foemsarl . 120061 : iGarcia and Herreral . 120081 1. However our discussion of the 
shortcomings of the mean-ranks test is valid regardless the specific approach adopted to 
control the FWER. 

The mean-rank test claims that the i-th and the j-th algorithm are significantly different 
if: 


|Ri — Rj\ > z* 


m(m + 1) 
6 n 


(4) 
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where R{ = ^ R t is the mean rank of the i-th algor ithm and z* is t he B onferr oni corrected 
a/m(m— 1) upper standard normal quantile Gibbons and Chakraborti, 201lJ, Sec. 12.2.1). 
Equation flU) is based on the large sample (n > 10) approximation of the distribution of 
the statistic. The actual distribution of the statistic | Ri — Rj\ is derived assuming all the 
(m!) n ranks in Q to be equally probable. Under this assumption the variance of \R{ — Rj\ 
is rn(rn + l)/6n, which originates the term under the square root in Q. 

The sampling distribution of the statistic | Rj — Rj | assumes all ranks configurations in 
(j2|) to be equally probable. Yet this assumption is not tenable: the post-hoc analysis is 
performed because the null hypothesis of the Friedman test has been rejected. 


4. Inconsistencies of the mean-ranks test 

We illustrate the inconsistencies the mean-ranks test by presenting three examples. All 
examples refer to the analysis of the accuracy of different classifiers on multiple data sets. 
We show that the outcome of the test depends both on the actual difference of accuracy 
between algorithm A and B and on the accuracy of the remaining algorithms. 


4.1 Example 1: artificially increasing power 

Assume we have tested five algorithms A, B , C, D, E on 20 datasets obtaining the accuracies: 


Datasets 


A 

50 

50 

50 

50 

50 

50 

50 

50 

50 

50 

80 

80 

80 

80 

80 

80 

80 

80 

80 

80 

B 

80 

80 

80 

80 

80 

80 

80 

80 

80 

80 

50 

50 

50 

50 

50 

50 

50 

50 

50 

50 

C 

55 

55 

55 

55 

55 

55 

55 

55 

55 

55 

45 

45 

45 

45 

45 

45 

45 

45 

45 

45 

D 

60 

60 

60 

60 

60 

60 

60 

60 

60 

60 

85 

85 

85 

85 

85 

85 

85 

85 

85 

85 

E 

65 

65 

65 

65 

65 

65 

65 

65 

65 

65 

90 

90 

90 

90 

90 

90 

90 

90 

90 

90 


The corresponding ranks are: 


Datasets 


A 

B 

C 

D 

E 


11111111113333333333 

55555555552222222222 

22222222221111111111 

33333333334444444444 

44444444445555555555 


where better algorithms are given higher ranks. We aim at comparing A and B. Algorithm 
B is better than A in the first ten datasets, while A is better than B in the remaining ten. 
The two algorithms have the same mean performance and their differences are symmetrically 
distributed. Each algorithms wins on half the data sets. Different types of two-sided tests 
(t-test, Wilcoxon signed-rank test, sign-test) return the same p-value, p = 1. The mean- 
ranks test correspond in this case to the sign-test and thus also its p-value is 1. This is 
most extreme result in favor of the null hypothesis. 

Now assume that we compare A, B together with C, D , E. In the first ten datasets, 
algorithm A is worse than C,D,E, which in turn are worse than B. In the remaining ten 
datasets, C is worse than A, B, which in turn are worse than D, E. The p-value of the 
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Friedman test is p ~ 10 -10 and, thus, it rejects the null hypothesis. We can thus perform 
the post-hoc test (J4]) with z* = 2.807 (the Bonferroni corrected a/m(m — 1) upper standard 
normal quantile for a = 0.05 and m = 5). The significance level has been adjusted to 
a/mn(m — 1), since we are performing m{m — l)/2 two-sided comparisons. The mean ranks 

of A, B are respectively 2 and 3.5 and, thus, since \Ra — Rb\ = 1-5 and z* ^ k, 1.4 

we can reject the null hypothesis. The result of the post-hoc test is that the algorithms 
A, B have significantly different performance. 

The decisions of the mean-ranks test are not consistent: 

• if it compares A, B alone, it does not reject the null hypothesis; 

• if it compares A, B together with C, D, E, it rejects the null hypothesis concluding 
that A, B have significantly different performance. 

The presence of C, D, E artificially introduces a difference between A, B by changing the 
mean ranks of A, B. For instance, D and E rank always better than A, while they never 
outperform B when it works well (i.e., datasets from one to ten); in a real case study, a 
similar result would probably indicate that while B is well suited for the first ten datasets, 
D, E and A are better suited for the last ten. The difference (in rank) between A and B 
is artificially amplified by the presence of D and E only when B is better than A. The 
point is that a large differences in the global ranks of two classifiers does not necessarily 
correspond to large differences in their accuracies (and viceversa, as we will see in the next 
example). 

This issue can happen in practiced Assume that a researcher presents a new algorithm 
A 0 and some of its weaker variations Ai, A 2 ,...,Ak and compares the new algorithms with 
an existing algorithm B. When B is better, the rank is B >- A$ >- ... A^. When Aq 

is better, the rank is Aq y A\ >~ ... >- A^ >- B. Therefore, the presence of Ai, A 2 ,...,Ak 
artificially increases the difference between Aq and B. 

4.2 Example 2: low power due to the remaining algorithms 

Assume the performance of algorithms A and B on different data sets to be normally 
distributed as follows: 

A~JV(0,1), B ~ iV(1.5,1). 

The pool of algorithms comprises also C, D, E, whose performance is distributed as 
follows: 

C ~ N( 5,1), D~iV(6,l), E~iV(7,1). 

A collection of 20 data sets is considered. 

For the sake of simplicity, assume we want to compare only A and B. There is thus no 
need of correction for multiple comparisons. 

When comparing A and B, the power of the two-sided sign test with a = 0.05 is very 
high: 0.94 (we have evaluated the power numerically by Monte Carlo simulation). The 
power of the mean-ranks test is instead only 0.046. We can explain the large difference 

1. We thank the anonymous reviewer for suggesting this example. 
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of power as follows. The sign test (under normal approximation of the distribution of the 
statistic) claims significance when: 


\Ra — Rb\ > z* 



while the mean-ranks test (j4]) claims significance when: 


is s i ^ * lm(m + 1) * [b 

\Ra- r b\>z V 6n = * 

with m = 5. Since the algorithms C,D,E have mean performances that are much larger 
than those of A, B , the mean-ranks difference \Ra — Rb\ is equal for the two test. However 
the mean-ranks estimates the variance of the statistic \Ra — Rb\ to be five times larger 
compared to the sign test. The critical value of the mean-ranks test is inflated by y/b, 
largely decreasing the power of the test. In fact for the mean-ranks test the variance of 
\Ra — Rb\ increases with the number of algorithms included in the initial experiment. 


4.3 Example 3: real classifiers on UCI data sets 


Finally, we compare the accuracies of seven classifiers on 54 datasets. The classifiers are: J48 
decision tree (Ci); hidden naive Bayes (C 2 ); averaged one-dependence estimator (AODE) 
(C 3 ); naive-Bayes (C 4 ); J48 graft (C 5 ), locally weighted naive-Bayes (Cq), random forest 
(C 7 ). The whole set of results is given in Appendix. Each classifier has been assessed via 
10 runs of 10-folds cross-validation. We performed all the experiments using WEKAl All 
these classifiers are described in ( Witten and Frank! . 2005 ). 

The accuracies are reported in Table El Assume that our aim is to compare C\, C 2 , C 3 , C 4 
alone. Therefore, we consider just the first 4 columns in Table El The mean ranks are: 


C 2 = 2.676, C 4 = 1.917, C x = 2.518, C 3 = 2.888. 


The Friedman test rejects the null hypothesis. The pairwise comparisons for the pair C 2 , C 4 
gives the statistic 

z = | R 2 — R±\/y/ m{m + l )/6 n = 3.06. 

Since 3.06 is greater than z* = 2.64 (the Bonferroni corrected a/m(m — 1) upper standard 
normal quantile for a = 0.05 and m = 4), the mean-ranks procedure finds the algorithms 
C 2 , C 4 to be significantly different. 

If we compare C 2 , Ci together with Ci, C 5 , the mean ranks are: 


C 2 = 2.713, Ci = 2 . 102 , Ci = 2.528, C 5 = 2.657. 


Again, Friedman test rejects the null hypothesis. The pairwise comparisons for the pair 
C 2 , Ci gives the statistic 


z — | R 2 — R 4 \/yf m(m + l)/ 6 n = 2.46, 

2. http://www.cs.waikato.ac.nz/ml/weka/ 
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Card=2 

Card=3 

Card=4 

C 2 

vs. 

c 4 

7/10 

9/10 

3/5 

C2 

vs. 

c 7 

1/10 

- 

- 

c 3 

vs. 

C 7 

2/10 

- 

- 

c 4 

vs. 

C 6 

9/10 

5/10 

- 


Table 1: Pairwise comparisons that are affected (numbers of decisions that are signif¬ 
icantly different/number of subsets) by the performance of the other algorithms. Here 
Card =2 means that, for each pair C a , Cb on the left column, we are considering the subsets 
{C a ,Cb,C x ,Cy}, Card=3 {C a , Cb, C x ,C y , C z } and Card=4 {C a ,Cb,C x ,C y ,C z ,C w }. The 
symbol means that the comparison does not depend on the subset of algorithms. 


which is smaller than 2 *. Thus the difference between algorithms C2 and C\ is not signifi¬ 
cant. 

The accuracies of C 2 and C 4 are the same in the two cases but again the decisions of 
the mean-ranks are conditional to the group of classifiers we are considering. 

Consider building a set of four classifiers {C2, C 4 , C x , C y }. By differently choosing C x 
and C y we can build ten different such sets. For each subset we run the mean-ranks test to 
check whether the difference between C 2 and C 4 is significantly different. The difference is 
claimed to be significant in 7 cases and not significant in 3 cases. 

Now consider a set of five classifiers {C2, C 4 , C x , C y , C z }. By differently choosing C x , 
C y and C z we can build ten different such sets. This yields 10 further cases in which we 
compare again C2 and C 4 . Their difference is claimed to be significant in 9/10 cases. 

Table CD reports the pairwise comparisons for which the statistical decision changes with 
the pool of classifiers that are considered. The outcome of the mean-ranks test when 
comparing the same pair of classifiers clearly depends on the pool of alternative classifiers 
{C x ,C y , ...} which is assumed. 


4.4 Maximum type I error 

A further drawback of the mean-ranks test which has not been discussed in the previous ex¬ 
amples is that it cannot control the maximum type I error, that is, the probability of falsely 
declaring any pair of algorithms to be different regardless of the other m — 2 algorithms. 
If the accuracies of all algorithms but one are equal, it does not guarantee the family-wise 
Type I error to be smaller t han a when comparing the m — 1 equivalent algorithms. We 
point the reader to (iFlignerl . 119841 ) for a detailed discussion on this aspect. 


5. A suggested procedure 

Given the above issues, we recommend to avoid the mean-ranks test for the post-hoc analy¬ 
sis. One should instead perform the multiple comparison using tests whose decision depend 
only on the two algorithms being compared, such as the sign test or the Wilcoxon signed- 
rank test. The sign test is more robust, as it only assumes the observations to be identically 
distributed. Its drawback is low power. The Wilcoxon signed-rank test is more powerful 
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and thus it is generally recommended ( Demsar . 20061 ). Compared to the sign test, the 
Wilcoxon signed-rank test makes the additional assumption of a symmetric distribution of 
the differences between the two algorithms being compared. The decision between sign test 
and signed-rank test thus depends on whether the symmetry assumption is tenable on to 
the analyzed data. 

Regardless the adopted test, the multiple comparisons should be performed adjusting 
the significance level to control the family-wise Type-I error. This can be done us i ng th e 
correction for multiple comparison discussed by (jPemsari . 120061 : iGarcia and Herreral . 120081 ) . 
If we adopt the Wilcoxon signed-rank test in Example 3 for comparing C 2 , C 4 , we obtain 
the p-value 0.0002, independently from the performance of the other algorithms. Thus, 
for any pool of algorithms C 2 ,C±,C x ,C y , we always report the same decision: 62,64 are 
significantly different because the p-value is less than the Bonferroni corrected significance 
level a/m(m — 1) (in the case m = 4, ct/m(m — 1) = 0.0042). 


6. Software 

The MATLAB scripts of the above examples can be downloaded from ipg. idsia. ch/sof tware/meanRanks/ma - 


7. Conclusions 


The mean-ranks post-hoc test is widely used test for multiple pairwise comparison. We 
discuss a number of drawbacks of this test, which we recommend to avoid. We instead 
recommend to adopt the sign-test or the Wilcoxon signed-rank, whose decision does not 
depend on the pool of classifiers included in the original experiment. 

We moreover bring to the attention of t he reader the Bayesian counterparts of these 
tests, which overcome the many drawbacks ( Kruschke . Ella . Chap. 11) of null-hypothesis 
significance testing. 
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Table of accuracies used in example 3 


Dataset 

Cl 

C2 

C3 

C4 

C5 

C6 

C7 

anneal 

98.44 

98 

98 

96.43 

98.55 

98.33 

99 

audiology 

78.32 

73.42 

71.66 

71.23 

78.32 

77.41 

73.89 

wisconsin-breast-cancer 

93.7 

96.71 

96.99 

97.14 

93.7 

97.28 

95.57 

cmc 

50.71 

52.81 

51.39 

51.05 

50.78 

50.98 

48.67 

contact-lenses 

81.67 

68.33 

71.67 

71.67 

81.67 

65 

78.33 

credit 

86.38 

84.64 

86.67 

86.23 

86.52 

87.25 

85.07 

german-credit 

72.4 

76.6 

76.6 

76 

72.4 

75.3 

73 

pima-diabetes 

73.7 

74.09 

75.01 

74.36 

73.56 

74.75 

72.67 

ecoli 

81.52 

80.04 

81.83 

82.12 

81.52 

80.63 

78.84 

eucalyptus 

64.28 

63.2 

58.71 

51.1 

64.01 

59.52 

59.4 

glass 

71.58 

74.26 

73.83 

70.63 

71.1 

75.69 

73.33 

grub-damage 

38.79 

36.88 

43.92 

47.79 

39.42 

40.13 

42.63 

haberman 

72.87 

71.53 

72.52 

72.52 

72.87 

73.52 

72.16 

hayes-roth 

60 

56.88 

60 

60 

60 

60 

59.38 

cleeland-14 

78.82 

81.47 

81.8 

83.44 

78.48 

82.78 

81.81 

hungarian-14 

78.64 

84.39 

84.39 

84.74 

78.64 

84.38 

81.97 

hepatitis 

79.46 

85.13 

83.79 

82.5 

79.46 

82.5 

81.25 

hypothyroid 

99.28 

99.18 

98.54 

98.3 

99.28 

98.62 

98.97 

ionosphere 

91.17 

90.88 

90.88 

89.17 

91.74 

89.17 

91.75 

iris 

93.33 

92 

92.67 

92.67 

93.33 

92 

93.33 

kr-s-kp 

99.44 

92.46 

91.24 

87.89 

99.37 

91.21 

98.87 

labor 

85 

88 

84.67 

83 

85 

81.33 

84.67 

lier-disorders 

56.25 

56.25 

56.25 

56.25 

56.25 

56.25 

56.25 

lymphography 

78.33 

85 

85.71 

84.38 

79 

86.33 

79.62 

monks 1 

98.74 

100 

85.44 

74.64 

98.74 

82.21 

98.56 

monks3 

98.92 

97.84 

96.75 

96.39 

98.92 

96.39 

97.84 

monks 

64.72 

64.57 

63.73 

62.24 

64.72 

64.9 

70.72 

mushroom 

100 

99.96 

99.95 

95.83 

100 

99.84 

100 

nursery 

97.05 

94.28 

92.71 

90.32 

97.08 

91.61 

98.09 

optdigits 

78.97 

96.17 

96.9 

92.3 

81.01 

94.2 

91.8 

page-blocks 

96.62 

96.84 

96.95 

93.51 

96.66 

94.15 

96.97 

pasture-production 

75 

85.83 

80.83 

80.83 

75 

81.67 

75.83 

pendigits 

89.05 

97.61 

97.82 

87.78 

89.87 

94.81 

95.67 

postoperatie 

70 

67.78 

67.78 

66.67 

70 

66.67 

60 

primary-tumor 

40.11 

48.08 

47.49 

46.89 

40.11 

49.55 

38.31 

segment 

94.24 

96.36 

94.5 

91.3 

94.03 

94.29 

96.06 

solar-flare-C 

88.86 

88.24 

88.54 

86.08 

88.86 

87.92 

86.05 

solar-flare-m 

90.1 

87.02 

87.92 

87 

90.1 

86.99 

85.46 

solar-flare-X 

97.84 

97.53 

97.84 

93.17 

97.84 

94.41 

95.99 

sonar 

74.48 

79.83 

81.26 

80.29 

74.45 

80.79 

78.36 

soybean 

92.39 

94.58 

93.4 

92.08 

92.98 

93.55 

92.68 

spambase 

92.81 

92.31 

93.37 

89.85 

93.22 

90.63 

93.65 

spect-reordered 

78.29 

82.07 

80.93 

79.03 

78.29 

83.15 

80.56 

splice 

94.36 

96.18 

96.21 

95.36 

94.2 

95.89 

89.37 

squash-stored 

70 

58 

60 

61.67 

70 

63.67 

57.67 

squash-unstored 

76.67 

69 

70.67 

61.67 

76.67 

68.67 

77.33 

tae 

47 

44.38 

47 

47 

47 

47 

45.67 

credit 

84.93 

83.91 

85.07 

84.2 

84.93 

85.22 

83.33 

owel 

76.67 

84.65 

77.78 

60.3 

76.87 

77.88 

84.95 

waveform 

74.38 

84.52 

84.92 

79.86 

74.9 

83.62 

79.68 

white-clover 

56.9 

79.29 

68.57 

66.9 

56.9 

64.76 

70 

wine 

88.79 

98.33 

98.33 

98.89 

89.35 

98.33 

97.22 

yeast 

57.01 

57.48 

56.74 

56.8 

57.01 

57.48 

56.26 

ZOO 

92.18 

100 

95.09 

93.18 

92.18 

96.18 

95.09 


Table 2: Accuracy of classifiers on different data sets. 
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